Adaptive learning system and method

ABSTRACT

A neural network module including an input layer having one or more input nodes arranged to receive input data, a rule base layer having one or more rule nodes, an output layer having one or more output nodes, and an adaptive component arranged to aggregate selected two or more rule nodes in the rule base layer based on the input data, an adaptive learning system having one or more of the neural network modules, related methods of implementing the neural network module and an adaptive learning system, and a neural network program.

FIELD OF INVENTION

The invention relates to an adaptive learning system and method and inparticular relates to a neural network module forming part of anadaptive learning system.

BACKGROUND TO INVENTION

Real world problems, such as massive biological data analysis andknowledge discovery, adaptive speech recognition and life-long languageacquisition, adaptive intelligent prediction and control systems,intelligent agent-based systems and adaptive agents on the Web, mobilerobots, visual monitoring systems, multi-modal information processing,intelligent adaptive decision support systems, adaptive domesticappliances and intelligent buildings, systems that learn and controlbrain and body states from a biofeedback, systems which classifybio-informatic data, and other systems require sophisticated solutionsfor building on-line adaptive knowledge base systems.

Such systems should be able to learn quickly from a large amount ofdata, adapt incrementally in an on-line mode, have an open structure soas to allow dynamic creation of new modules, memorise information thatcan be used later, interact continuously with the environment in a“life-long” learning mode, deal with knowledge as well as with data, andadequately represent space and time in their structure.

Well established neural network and artificial intelligence (AI)techniques have difficulties when applied for on-line knowledge basedlearning. For example, multi-layer perceptrons (MLP) and backpropagationlearning algorithms have a number of problems, for example catastrophicforgetting, local minima problem, difficulties in extracting rules,inability to adapt to new data without retraining on old data, andexcessive training times when applied to large data sets.

The self-organising map (SOM) may not be efficient when applied forunsupervised adaptive learning on new data, as the SOM assumes a fixedstructure and a fixed grid of nodes connected in a topological outputspace that may not be appropriate to project a particular data set.Radial basis neural networks require clustering to be performed firstand then the back propagation algorithm applied. Neuro-fuzzy systemscannot update the learned rules through continuous training onadditional data without catastrophic forgetting.

These type of networks are not efficient for adaptive, on-line learning,although they do provide an improvement over prior techniques.

SUMMARY OF INVENTION

In one form the invention comprises a neural network module comprisingan input layer comprising one or more input nodes arranged to receiveinput data; a rule base layer comprising one or more rule nodes; anoutput layer comprising one or more output nodes; and an adaptivecomponent arranged to aggregate selected two or more rule nodes in therule base layer based on the input data.

In another form the invention comprises a method of implementing aneural network module comprising the steps of arranging an input layercomprising one or more input nodes to receive input data; arranging arule base layer comprising one or more rule nodes; arranging an outputlayer comprising one or more output nodes; and arranging an adaptivecomponent to aggregate selected two or more rule nodes in the rule baselayer based on the input data.

In a further form the invention comprises a neural network computerprogram comprising an input layer comprising one or more input nodesarranged to receive input data; a rule base layer comprising one or morerule nodes; an output layer comprising one or more output nodes; and anadaptive component arranged to aggregate selected two or more rule nodesin the rule base layer based on the input data.

BRIEF DESCRIPTION OF THE FIGURES

Preferred forms of the adaptive learning system and method will now bedescribed with reference to the accompanying figures in which:

FIG. 1 is a schematic view of hardware on which one form of theinvention may be implemented;

FIG. 2 is a further schematic view of an adaptive learning system of theinvention;

FIG. 3 is a schematic view of a neural network module of FIG. 2;

FIG. 4 is an example of membership functions for use with the invention;

FIG. 5 is an example of a rule mode of the invention;

FIG. 6 illustrates the adjustment and learning process relating to therule node of FIG. 5;

FIG. 7 shows an adaptive learning system of the invention having threerule nodes;

FIG. 8 shows one method of aggregating the rule nodes of FIG. 7;

FIG. 9 illustrates another method of aggregating the three rule nodes ofFIG. 7;

FIGS. 10 and 11 illustrate the aggregation of spatially allocated rulenodes;

FIGS. 12 and 13 illustrate the aggregation of linearly allocated rulenodes;

FIGS. 14 to 17 illustrate different allocation strategies for new rulenodes;

FIGS. 18A and 18B illustrate the system learning a complex time serieschaotic function;

FIG. 19 is a table of selected rules extracted from a system trained onthe function of FIG. 18;

FIGS. 20 and 21 illustrate the system learning from time series dataexamples;

FIGS. 22 and 23 illustrate unsupervised continuous learning by thesystem;

FIG. 24 illustrates evolved rule nodes and the tajectory of a spokenword ‘zoo’ in the two dimensional space of the first two principalcomponents in a system trained with a mix of spoken words in NZ Englishand Maori;

FIG. 25 illustrates comparative analysis of the learning model of thesystem with other models;

FIG. 26 is a table showing global test accuracy of a known methodcompared with the system of the invention;

FIG. 27 illustrates a rule from a set of rules extracted from an evolvedsystem from a sequence of biological data for the identification of asplice junction between introns and exons in a gene; and

FIG. 28 illustrates a rule from a set of rules extracted from an evolvedsystem from a micro-array gene expression data taken from two types—ALLand AML of leukaemia cancer tissues.

DETAILED DESCRIPTION OF PREFERRED FORMS

FIG. 1 illustrates preferred form hardware on which one form of theinvention may be implemented. The preferred system 2 comprises a dataprocessor 4 interfaced to a main memory 6, the processor 4 and thememory 6 operating under the control of appropriate operating andapplication software or hardware. The processor 4 could be interfaced toone or more input devices 8 and one or more output devices 10 with anI/O controller 12. The system 2 may further include suitable massstorage devices 14 for example, floppy, hard disk or CD Rom drives orDVD apparatus, a screen display 16, a pointing device 17, a modem 18and/or network controller 19. The various components could be connectedvia a system bus or over a wired or wireless network.

In one form the invention could be arranged for use in speechrecognition and to be trained on model speech signals. In this form, theinput device(s) 8 could comprise a microphone and/or a further storagedevice in which audio signals or representations of audio signals arestored. The output device(s) 10 could comprise a printer for displayingthe speech or language process by the system, and/or a suitable speakerfor generating sound. Speech or language could also be displayed ondisplay device 16.

Where the invention is arranged to classify bio-informatics case studydata, this data could be stored in a mass storage device 14, accessed bythe processor 4 and the results displayed on a screen display 16 and/ora further output device 10.

Where the system 2 is arranged for use with a mobile robot, the inputdevice(s) 8 could include sensors or other apparatus arranged to formrepresentations of an environment. The input devices could also includesecondary storage in which a representation of an environment is stored.The output device(s) 10 could include a monitor or visual display unitto display the environment processed by the system. The processor 4could also be interfaced to motor control means to transport the robotfrom one location in the processed environment to another location.

It will be appreciated that the adaptive learning system 2 could bearranged to operate in many different environments and to solve manydifferent problems. In each case, the system 2 evolves its structure andfunctionality over time through interaction with the environment throughthe input devices 8 and the output devices 10.

FIG. 2 illustrates the computer-implemented aspects of the inventionstored in memory 6 and/or mass storage 14 and arranged to operate withprocessor 4. The preferred system is arranged as an evolvingconnectionist system 20. The system 20 is provided with one or moreneural network modules or NNM 22. The arrangement and operation of theneural network module(s) 22 forms the basis of the invention and will befurther described below.

The system includes a representation or memory component 26 comprisingone or more neural network modules 22. The representation component 26preferably includes an adaptation component 28 as will be particularlydescribed below which enables rule nodes to be inserted extracted and/oraggregated.

The system 20 may include a number of further known components, forexample a feature selection component 24 arranged to perform filteringof the input information, feature extraction and forming the inputvectors.

The system may also include a higher level decision component 30comprising one or more modules which receive feedback from theenvironment 34, an action component 32 comprising one or more moduleswhich take output values from the decision component and pass outputinformation to the environment 34, and a knowledge base 36 which isarranged to extract compressed abstract information from therepresentation component 26 and the decision component 30 in the form ofrules, abstract associations and other information. The knowledge base36 may use techniques such as genetic algorithms or other evolutionarycomputation techniques to evaluate and optimise the parameters of thesystem during its operation.

FIG. 3 illustrates one preferred form of neural network module 22. Thepreferred structure is a fuzzy neural network which is a connectioniststructure which implements fuzzy rules. The neural network module 22includes input layer 40 having one or more input nodes 42 arranged toreceive input data.

The neural network module 22 may further comprise fuzzy input layer 44having one or more fuzzy input nodes 46. The fuzzy input nodes 46transform data from the input nodes 42 for the further use of thesystem. Each of the fuzzy input nodes 46 could have a differentmembership function attached to it. One example of a membership functionis the triangular membership function shown in FIG. 4. The membershipfunction could also include Gaussian functions or any other knownfunctions suitable for the purpose. The system is preferably arranged sothat the number and type of the membership function may be dynamicallymodified as will be described further below. The main purpose of thefuzzy input nodes 46 is to transform the input values from the inputnodes 42 into membership degrees to which the values belong to themembership function.

The neural network module 22 further comprises rule base layer 48 havingone or more rule nodes 50. Each rule node 50 is defined by two vectorsof connection weights W1(r) and W2(r). Connection weight W1(r) ispreferably adjusted through unsupervised learning based on similaritymeasure within a local area of the problem space. W2(r), on the otherhand, is preferably adjusted through supervised learning based on outputerror, or on reinforcement learning based on output hints. Connectionweights W1(r) and W2(r) are further described below.

The neural network module 22 may further comprise a fuzzy output layer52 having one or more fuzzy output nodes 54. Each fuzzy node 54represents a fuzzy quantisation of the output variables, similar to thefuzzy input nodes 46 of the fuzzy input layer 54. Preferably, a weightedsum input function and a saturated linear activation function are usedfor the nodes to calculate the membership degrees to which the outputvector associated with the presented input vector belongs to each of theoutput membership functions.

The neural network module also includes output layer 56 having one ormore output nodes 58. The output nodes 58 represent the real values ofthe output variables. Preferably a linear activation function is used tocalculate the de-fuzzified values for the output variables.

The preferred form rule base layer 48 comprising one or more rule nodes50 representing prototypes of input-output data associations that can begraphically represented as associations of hyper-spheres from the fuzzyinput layer 44 spaces and the fuzzy output layer 52 spaces. Each rulenode 50 has a minimum activation threshold which is preferablydetermined by a linear activation function.

As shown in FIG. 3, the neural network module 22 may also include ashort-term memory layer 60 having one or more memory nodes 62. Thepurpose of the short-term memory layer 60 is to memorise structurallytemporal relationships of the input data. The short-term memory layer ispreferably arranged to receive information from and send information tothe rule base layer 48.

As described above, each rule node 50 represents an association betweena hyper-sphere from the fuzzy input space and a hyper-sphere from thefuzzy output space. These spheres are described with reference to FIG.5, which illustrates example rule node 70 shown as r_(j). Rule noder_(j) has an initial hyper-sphere 72 in the fuzzy input space. The rulenode r_(j) has a sensitivity threshold parameter S_(j) which defines theminimum activation threshold of the rule node r_(j) to a new inputvector x from a new example or input (x,y) in order for the example tobe considered for association with this rule node. A new input vector xactivates a rule node if x satifies the minimum actuation threshold andis subsequently considered for association with the rule node. Theradius of the input hyper-sphere 72 is defined as R_(j)=1−S_(j), S_(j)being the sensitivity threshold parameter.

Rule node r_(j) has a matrix of connection weights W1 (r_(j)) whichrepresents the coordinates of the centre of the sphere 72 in the fuzzyinput space. Rule node r_(j) also has a fuzzy output space hyper-sphere74, the coordinates of the centre of the sphere 74 being connectionweights W2 (r_(j)). The radius of the output hyper-sphere 74 is definedas E which represents the error threshold or error tolerance of the rulenode 70. In this way it is possible for some rule nodes to be activatedmore strongly than other rule nodes by input data.

A new pair of data vectors (x,y) is transformed to fuzzy input/outputdata vectors (x_(f), y_(f)) which will be allocated to the rule node 70if x_(f) falls within input hyper-sphere 72 and y_(f) falls within theoutput hyper-sphere 74 when the input vector x is propagated through theinput node. The distance of x_(f) from the centre of input hyper-sphere72 and the distance of y_(f) from the centre of output hyper-sphere 74provides a basis for calculating and assigning the magnitude or strengthof activation. This strength of activation provides a basis forcomparing the strengths of activation of different rule nodes. Thereforea further basis for allocation is where the rule node 70 receives thestrongest activation among other rule nodes. The data vectors (x_(f),y_(f)) will be associated with rule node 70 if the local normalisedfuzzy difference between x_(f) and W1 (r_(j)) is smaller than the radiusR_(j), and the normalised output error Err=∥y−y′∥/Nout is smaller thanan error threshold E, Nout is the number of the outputs and y′ isproduced by the system output. The E parameter sets the error toleranceof the system.

In the preferred method a local normalised fuzzy difference (distance)between two fuzzy membership vectors d_(1f) and d_(2f) that representthe membership degrees to which two real vector data d₁ and d₂ belong topre-defined MFs, is calculated as:D(d _(1f) ,d _(2f))=∥d _(1f) −d _(2f) ∥/∥d _(1f) +d _(2f)∥  (1)where: ∥x−y∥ denotes the sum of all the absolute values of a vector thatis obtained after vector subtraction (or summation in case of ∥x+y∥) oftwo vectors x and y; “/” denotes division. For example, if d_(1f)=(0, 0,1, 0, 0, 0) and d_(2f)=(0, 1, 0, 0, 0, 0), then D(d₁, d₂)=(1+1)/2=1which is the maximum value for the local normalised fuzzy difference.

As new inputs are fed to rule node 70, these data inputs relevant tor_(j) may be associated with rule node 70 providing an opportunity forlearning. As new fuzzy input/output data vector (x_(f), y_(f)) is fed tothe rule node 70, the centre of the input hyper-sphere 72 is adjusted toa new sphere indicated at 72A by adjusting W1 (r_(j) ⁽¹⁾) to W1 (r_(j)⁽²⁾). The output hyper-sphere 74 is also adjusted to new sphere as shownat 74A by adjusting W2 (r_(j) ⁽¹⁾) to W2 (r_(j) ⁽²⁾).

The centres of the node hyper-spheres are adjusted in the fuzzy inputspace depending on the distance between the new input vector and therule node through a learning rate l_(j), a parameter that isindividually adjusted for each rule node. The adjustment of thehyper-spheres in the fuzzy output space depends on the output error andalso on the learning rate l_(j) through the Widrow-Hoff LMS algorithm,also called the Delta algorithm.

This adjustment in the input and in the output spaces can be representedmathematically by the change in the connection weights of the rule noder_(j) from W1(r_(j) ⁽¹⁾) and W2(r_(j) ⁽¹⁾) to W1(r_(j) ⁽²⁾) and W2(r_(j)⁽²⁾) respectively according to the following vector operations:W1(r _(j) ⁽²⁾)=W1(r _(j) ⁽¹⁾)+l _(j).(W1(r _(j) ⁽¹⁾)−x _(f))  (2)W2(r _(j) ⁽²⁾)=W2(r _(j) ⁽¹⁾)+l _(j).(A2−y _(f)).A1(r _(j) ⁽¹⁾)  (3)where: A2=f₂(W2.A1) is the activation vector of the fuzzy output neuronswhen the input vector x is presented; A1(r_(j) ⁽¹⁾)=f1(D(W1(r_(j)⁽¹⁾),x_(f))) is the activation of the rule node r_(j) ⁽¹⁾; a simplelinear function can be used for f₁ and f₂, e.g. A1(r_(j)⁽¹⁾)=1−D(W1(r_(j) ⁽¹⁾),x_(f))), where D is the fuzzy normalised distancemeasure; l_(j) is the current learning rate of the rule node r_(j)calculated as l_(j)=1/Nex(r_(j)), where Nex(r_(j)) is the number ofexamples currently associated with rule node r_(j). The statisticalrationale behind this is that the more examples that are currentlyassociated with a rule node the less it will “move” when a new examplehas to be accommodated by this rule node, i.e. the change in the rulenode position is proportional to the number of already associatedexamples which is a statistical characteristic of the method.

When a new example is associated with a rule node r_(j) not only itslocation in the input space changes, but also its receptive fieldexpressed as its radius Rj, and its sensitivity threshold Sj:Rj ⁽²⁾ =Rj ⁽¹⁾ +D(W1(r _(j) ⁽²⁾), W1(r _(j) ⁽¹⁾)), Rj ⁽²⁾<=Rmax  (4)respectively  (3)Sj ⁽²⁾ =Sj ⁽¹⁾ −D(W1(r _(j) ⁽²⁾), W1(r _(j) ⁽¹⁾))  (5)where Rmax is a parameter set to restrict the maximum radius of thereceptive field of a rule node.

The adjustment and learning process in the fuzzy input space isillustrated in FIG. 6 which schematically illustrates how the centrer_(j) ⁽¹⁾ 82 of the rule node r_(j) 80 adjusts, after learning each newdata point, to its new position r_(j) ⁽⁴⁾ 84 based on one pass learningon the four data points d₁, d₂, d₃ and d₄.

The adaptation component of the preferred system enables rule nodes tobe inserted, extracted and adapted or aggregated as will be describedbelow. At any time or phase of the evolving or learning process, fuzzyor exact rules may be inserted by setting a new rule node r_(j) for eachnew rule, such that the connection weights W1 (r_(j)) and W2 (r_(j)) ofthe rule node represent this rule.

For example, the fuzzy rule (IF x₁ is Small and x₂ is Small THEN y isSmall) may be inserted into the neural network module 22 by setting theconnections of a new rule node to the fuzzy condition nodes x1−Small andx2−Small and to the fuzzy output node y−Small to a value of 1 each. Therest of the connections are set to a value of 0.

Similarly, an exact rule may be inserted into the module 22, for exampleIF x₁ is 3.4 and x₂ is 6.7 THEN y is 9.5. Here, the membership degreesto which the input values x₁=3.4 and x₂=6.7 and the output value y=9.5belong to the corresponding fuzzy values are calculated and attached tothe corresponding connection weights.

The preferred adaptation component also permits rule extraction in whichnew rules and relationships are identified by the system. Each rule noder_(j) can be expressed as a fuzzy rule, for example:

Rule r: IF x₁ is Small 0.85 and x₁ is Medium 0.15 and x₂ is Small 0.7and x₂ is Medium 0.3 {radius of the receptive field of the rule r is0.5}

THEN y is Small 0.2 and y is Large 0.8 {Nex(r) examples associated withthis rule out of Nsum total examples learned by the system}.

The numbers attached to the fuzzy labels denote the degree to which thecentres of the input and the output hyper-spheres belong to therespective membership functions.

The adaptation component preferably also permits rule node aggregation.Through this technique, several rule nodes are merged into one as isshown in FIGS. 7, 8 and 9 on an example of 3 rule nodes r₁, r₂ and r₃.

FIG. 7 illustrates a neural network module similar to the module of FIG.3. The module may comprise, for example, an input layer 40, a fuzzyinput layer 44, a rule base layer 48, a fuzzy output layer 52 and anoutput layer 56. The rule base layer 48 includes, for example, rulenodes r₁, r₂ and r₃ indicated at 90, 92 and 94 respectively.

For the aggregation of these three rule nodes r₁, r₂, and r₃ thefollowing two aggregation strategies can be used to calculate the newaggregated rule node r_(agg), W1 connections (the same formulae are usedto calculate the W2 connections):

-   -   as a geometrical centre of the three nodes:        W1(r _(agg))=(W1(r ₁)+W1(r ₂)+W1(r ₃))/3  (6)    -   as a weighted statistical centre:        W1(r _(agg))=(W1(r ₁).Nex(r ₁)+W1(r ₂). Nex(r ₂)+W1(r ₃).Nex(r        ₃))/Nsum  (7)        Nex(r _(agg))=Nsum=Nex(r ₁)+Nex(r ₂)+Nex(r ₃);  (8)        Rr _(agg) =D(W1(r _(agg)), W1(r _(j)))+Rj<=Rmax;  (9)        where r_(j) is the rule node from the three nodes that has a        maximum distance from the new node r_(agg) and Rj is its radius        of the receptive field. The three rule nodes will aggregate only        if the radius of the aggregated node receptive field is less        than a pre-defined maximum radius Rmax.

FIG. 8 shows an example of aggregation as a geometrical centre of thethree nodes whereas FIG. 9 shows aggregation as a weighted statisticalcentre.

In order for a given node r_(j) to “choose” the other nodes with whichit should aggregate, two subsets of nodes are formed—the subset of nodesr_(k) that if activated to a degree of 1 will produce an output valuey′(r_(k)) that is different from y′(r_(j)) in less than the errorthreshold E, and the subset of nodes that cause output values differentfrom y′(r_(k)) in more than the error threshold E. The W2 connectionsdefine these subsets. All the rule nodes from the first subset that arecloser to r_(j) in the input space than the closest to r_(j) node fromthe second subset in terms of W1 distance, get aggregated if thecalculated radius of the new node r_(agg) is less than the pre-definedlimit Rmax for a receptive field as illustrated on FIG. 9.

Instead of aggregating all the rule nodes that are closer to a rule noder_(j) than the closest node from the other class, it is possible to keepthe closest node from the aggregation pool to the other class out of theaggregation procedure—as a separate node—a “guard”, as shown in FIGS.10, 11, 12 and 13, thus preventing future misclassification on thebordering area between the two classes.

The aggregation of spatially allocated rule nodes is described withreference to FIGS. 10 and 11. Referring to FIG. 10, two distinct sets ofrule nodes have been selected and sorted for aggregation, showngenerally as 100 and 102 respectively. Referring to FIG. 11, rule node104 is classified as a guard and is not aggregated. The remaining rulenodes in set 100 are aggregated into new rule 106. Similarly, rule node108 is not aggregated with remaining aggregated rule nodes in set 102shown at 110. In accordance with the invention, the sensitivitythreshold and error threshold of rule nodes 104 and 108 are decreased toincrease the activation threshold of these nodes resulting in aggregatednodes 106 and 110 being activated in preference to guard nodes 104 and108.

FIGS. 12 and 13 illustrate the same process of aggregation as thatdescribed in FIGS. 10 and 11 with the exception that the rule nodes arelinearly allocated rather than spatially allocated, as they are in FIGS.10 and 11.

Aggregation in accordance with the invention is preferably performedafter a certain number of examples are presented (parameter N_(agg))over the whole set of rule nodes.

In a further preferred form the system nodes r₁ that are not aggregatedmay decrease their sensitivity threshold S₁ and increase their radius R₁with a small coefficient in order for these nodes to have more chancesto win the activation competition for the next input data examples andcompete with the rest of the nodes.

Through node creation and consecutive aggregation, the preferred neuralnetwork module 22 may adjust over time to changes in the data stream andat the same time preserve its generalisation capabilities.

After a certain time (when certain number of data examples have beenpresented to the system) some neurons and connections may be pruned.Different pruning rules can be applied for a successful pruning ofunnecessary nodes and connections. One of them is given below:

IF (Age(r_(j))>OLD)AND(the total activation TA(r_(j)) is less than apruning parameter Pr times Age (r_(j))) THEN prune rule node r_(j),

where Age(r_(j)) is calculated as the number of examples that have beenpresented to the system after r_(j) had been first created; OLD is apre-defined “age” limit; Pr is a pruning parameter in the range [0,1],and the total activation TA(r_(j)) is calculated as the number ofexamples for which r_(j) has been the correct winning node (or among them winning nodes in the m-of-n mode of operation).

The above pruning rule requires that the fuzzy concepts of OLD, HIGH,etc. are defined in advance. As a partial case, a crisp value can beused, e.g. a node is OLD if it has existed during the evolving of asystem from more than p examples. The pruning rule and the way thevalues for the pruning parameters are defined, depend on the applicationtask.

Parameters of each rule node may be either kept fixed during the entireoperation of the system, or can be adapted or optimised according to theincoming data. Adaptation may be achieved through the analysis of thebehaviour of the system and through a feedback connection from thehigher level modules. Genetic algorithms and evolutionary programmingtechniques can also be applied to optimise the structural and functionalparameters of the neural network module 22.

In a further preferred form of the invention, a population of s systemsis evolved simultaneously, each system having different parametervalues. A certain “window” of incoming data is kept and updated fortesting the fitness of the individually evolved system based on meansquare error fitness function. The best system is selected and“multiplied” through small deviations of the parameter values thuscreating the next generation of population. The process is continuous inan unlimited way in time.

In terms of implementing the method and the system in a computer memory,when created, new rule nodes are either spatially or linearly allocatedin the computer memory and the actual allocation of nodes could followone of a number of different strategies as is described below.

One such strategy, as shown in FIG. 14, could be a simple consecutiveallocation strategy. Each newly created rule node is allocated in thecomputer memory next to the previous and to the following rule nodes, ina linear fashion, representing a time order.

Another possible strategy could be a pre-clustered location as shown inFIG. 15. For each output fuzzy node, there is a pre-defined location inthe computer memory where the rule nodes supporting this pre-definedconcept are located. At the centre of this area, the nodes that fullysupport this concept are placed. Every new rule node's location isdefined based on the fuzzy output error and the similarity with othernodes. In a nearest activated node insertion strategy, a new rule nodeis placed nearest to the highly activated node which activation is stillless than its sensitivity threshold. The side (left or right) where thenew node is inserted, is defined by the highest activation of the twoneighbouring nodes.

A further strategy could include the pre-clustered location describedabove further including temporal feedback connections between differentparts of the computer memory loci, as shown in FIG. 16. New connectionsare set that link consecutively activated rule nodes through using theshort term memory and the links established through the W3 weightmatrix. This will allow the neural network module 22 to repeat asequence of data points starting from a certain point and notnecessarily from the beginning.

A further strategy could include the additional feature that newconnections are established between rule nodes from different neuralnetwork modules that become activated simultaneously, as shown in FIG.17. This feature would enable the system to learn a correlation betweenconceptually different variables, for example the correlation betweenspeech sound and lip movement.

An important feature of the adaptive learning system and methoddescribed above is that learning involves local element tuning. Only onerule node (or a small number, if the system operates in m-of-n mode)will be updated for each data example, or alternatively only one rulenode will be created. This speeds up the learning procedure,particularly where linear activation functions are used in the neuralnetwork modules. A further advantage is that learning a new data exampledoes not cause forgetting of old examples. Furthermore, new input andnew output variables may be added during the learning process, therebymaking the adaptive learning system more flexible to accommodate newinformation without disregarding already learned information.

The use of membership functions, membership degrees and normalised localfuzzy distance enables the system to deal with missing attribute values.In such cases, the membership degrees of all membership functions willbe 0.5 indicating that the value, if it existed, may belong equally tothem. Preference, in terms of which fuzzy membership functions themissing value may belong to, can also be represented through assigningappropriate membership degrees.

The preferred supervised learning algorithms of the invention enable thesystem to continually evolve and learn when a new input-output pair ofdata becomes available. This is known as an active mode of learning. Inanother mode, passive learning, learning is performed when there is noinput pattern presented. Passive learning could be conducted after aninitial learning. When passive learning, existing connections that storepreviously fed input patterns are used as “echo” to reiterate thelearning process. This type of learning could be applied in case of ashort presentation time of the data, when only a small portion of thedata is learned in one pass online mode and then the training is refinedthrough the echo learning method. The stored patterns in the W1connection weights can be used as input vectors for the systemrefinement with the W2 patterns indicating what the outputs will be.

Two preferred supervised learning algorithms are described below. Eachlearning algorithm differs in the weight adjustment formulae.

The first learning algorithm is set out below:

Set initial values for the system parameters: number of membershipfunctions; initial sensitivity thresholds (default Sj=0.9); errorthreshold E; aggregation parameter Nagg−number of consecutive examplesafter each aggregation is performed; pruning parameters OLD an Pr; avalue for m (in m-of-n mode); maximum radius limit Rmax; thresholds T₁and T₂ for rule extraction.

Set the first rule node r₀ to memorise the first example (x,y);W1(r₀)=x_(f), and W2(r₀)=y_(f); Loop over presentations of newinput-output pairs (x,y) { Evaluate the local normalised fuzzy distanceD between x_(f) and the existing rule node connections W1 (formulae (1))Calculate the activation A1 of the rule node layer. Find the closestrule node r_(k) (or the closest m rule nodes in case of m-of-n mode) tothe fuzzy input vector x_(r) for which A1(r_(k)) >= S_(k) (sensitivitythreshold for the node r_(k)), if there is no such a node, create a newrule node for (x_(f),y_(f)) else Find the activation of the fuzzy outputlayer A2=W2.A1(1-D(W1,x_(r)))) and the normalised output error Err= | |y- y‘| | / Nout. if Err > E create a new rule node to accommodate thecurrent example (x_(f),y_(f)) else Update W1 (r_(k)) and W2(r_(k))according to (2) and (3) (in case of m-of-n system update all the m rulenodes with the highest A1 activation). Apply aggregation procedure ofrule nodes after each group of N_(agg) examples are presented Update thevalues for the rule node r_(k) parameters S_(k), R_(k), Age(r_(k)), TA(r_(k)). Prune rule nodes if necessary, as defined by pruningparameters. Extract rules from the rule nodes { }

A modified version of the above algorithm is when the number of thewinning rule nodes is chosen to be not 1, but m>1 (by default m=3). Thismode is called “m-of-n”.

The second learning algorithm is different from the first learningalgorithm in the weight adjustment formula for W2 as follows:W2(r _(j) ⁽²⁾)=W2(r _(j) ⁽²⁾)+l _(j).(A2−y _(f)). A1(r _(j) ⁽²⁾)  (11)

This means that after the first propagation of the input vector anderror Err calculation, if the weights are going to be adjusted, W1weights are adjusted first using equation (2) above and then the inputvector x is propagated again through the already adjusted rule noder_(j) to its position r_(j) ⁽²⁾ in the input space, a new errorErr=(A2−y_(f)) is calculated and after that the W2 weights of the rulenode r_(j) are adjusted. This is a finer weight adjustment than theadjustment in the first algorithm that may make a difference in learningshort sequences, but for learning longer sequences it may not manifestany difference in the results obtained through the simpler and fasterfirst algorithm.

In addition to supervised learning, the system is also preferablyarranged to perform unsupervised learning in which it is assumed thatthere are no desired output values available and the system evolves itsrule nodes from the input space. A node allocation is based only on thesensitivity thresholds S_(j) and on the learning rates l_(j). If a newdata item d activates a certain rule node (or nodes) above the level ofits parameter S_(j), then this rule node (or the one with the highestactivation) is adjusted to accommodate the new data item according toequation (2) above, or alternatively a new rule node is created. Theunsupervised learning method of the invention is based on the stepsdescribed above as part of the supervised learning method when only theinput vector x is available for the current input data item d.

Both the supervised and the unsupervised learning methods for the systemare based on the same principles of building the W1 layer ofconnections. Either class of method could be applied on an evolvingsystem so that if there are known output values, the system will use asupervised learning method, otherwise it will apply the unsupervisedlearning method on the same structure. For example, after having evolvedin an unsupervised way, a neural network module from a spoken word ofinput data, the system may then use data labelled with the appropriatephoneme labels to continue the learning process of this system, now in asupervised mode.

The preferred system may also perform learning from output hints, orthrough reinforcement learning, in addition to the unsupervised orsupervised learning. This is the case when the exact, desired outputvalues do not become known for the purpose of adjusting the W2connection weights. Instead, fuzzy hints F given in fuzzy linguisticlabels that are used in the fuzzy output space may be given as afeedback, e.g. “low output value is the desired one” while the outputvalue produced by the system is “very low”. The system then calculatesthe fuzzy output error Errf=A2−F and then adjusts the connections W2through formula (3).

The preferred system may also perform inference and have the ability togeneralise on new input data. The inference method is part of thelearning method when only the input vector x is propagated through thesystem. The system calculates the winner, or m winners, as follows: awinning rule node r for an input vector x is the node with: (i) thehighest activation A1(r) among other rule nodes for which, (ii):D(x, W1(r))<=Rr,  (12)where: D(x, W1(r)) is the fuzzy normalised distance between x and W1(r);Rr is the radius of the rule node r. If there is no rule node thatsatisfies the condition (ii) for the current input vector x, onlycondition (i) is used to select the winner.

In a preferred form of the invention with reference to FIG. 3 above, atemporal layer 60 of temporal nodes 62 captures temporal dependenciesbetween consecutive data examples. If the winning rule node at themoment (t−1), to which the input data vector at the moment (t−1) isassociated, is r_(max) ^((t−1)) and the winning node at the moment t isr_(max) ^((t)), then a link between the two nodes is established asfollows:W3(r _(max) ^((t−)1),r _(max) ^((t)))=W3(r _(max) ^((t−)1),r _(max)^((t)))+l ₃ . A1(r _(max) ^((t−)1))A1(r _(max) ^((t)))  (13)where A1(r^((t))) denotes the activation of a rule node r at a timemoment (t) and l₃ defines the degree to which the neural network module22 associates links between rule nodes that include consecutive dataexamples. If l₃=0, no temporal associations are learned in the structureand the temporal layer 60 is effectively removed from the neural networkmodule 22.

The learned temporal associations could be used to support theactivation of rule nodes based on temporal pattern similarity. Here,temporal dependencies are learned through establishing structural links.These dependencies can be further investigated and enhanced throughsynaptic analysis, at the synaptic memory level, rather than throughneuronal activation analysis at the behavioural level. The ratiospatial-similarity/temporal correlation can be balanced for differentapplications through two parameters S_(s) and T_(c), such that theactivation of a rule node r for a new data example d=(x,y) is definedthrough the following vector operations:A1(r)=|1−Ss.D(W1(r),x _(f))+Tc. W3(r _(max) ^((t−1)) ,r)|_([0,1])  (14)where |.|_([0,1]) is bounded operation in the interval [0,1, and r_(max)^((t−1)) is the winning neuron at the previous time moment. Heretemporal connections can be given a higher importance in order totolerate a higher distance in time for time-dependent input vectors. IfT_(c)=0, then temporal links are excluded from the functioning of thesystem.

The system is arranged to learn a complex chaotic function throughonline evolving from one pass data propagation. The system is alsoarranged to learn time series that change their dynamics through timeand never repeat same patterns. Time series processes with changingdynamics could be of different origins, for example biological,environmental, industrial processes control, financial. The system couldalso be used for off-line training and testing similar to other standardneural network techniques.

An example of learning a complex chaotic function is described withreference to FIGS. 18A and 18B. Here, the system is used with theMackey-Glass chaotic time series data generated through the Mackey-Glasstime delay differential equation:

$\begin{matrix}{\frac{\mathbb{d}(x)}{\mathbb{d}(t)} = {\frac{{ax}( {t - \tau} )}{1 + {x^{10}( {t - \tau} )}} - {b \times {(t).}}}} & (15)\end{matrix}$

This series behaves as a chaotic time series for some values of theparameters x (0) and τ. Here, x (0)=1.2, τ=17, a=0.2, b=0.1 and x (t)=0for t<0. The input-output data for evolving the system from theMackey-Glass time series data has an input vector [x(t), x(t−6), (t−12),x(t−18)] and the output vector is [x(t+6)]. The task is to predictfuture values x(t+6) from four points spaced at six time intervals inthe past.

For the example, values for the system parameters are initially set asfollows:

S=0.92, E=0.08, l=0.005, aggregation threshold is Rmax=0.15 and ruleextraction thresholds T₁=T₂=0.1. Aggregation is performed after eachconsecutive group of N_(agg)=50 examples is presented.

Experimental results of the on-line evolving of the system are shown inFIGS. 18A and 18B. In particular, the desired versus predicted six stepsahead values through one-pass on-line learning, the absolute, the localon-line RMSE (LRMSE) and the local on-line NDEI (LNDEI) error over timeas described below, the number of the rule nodes created and aggregatedover time, and a plot of the input data vectors shown as circles and theevolved rule nodes, the W1 connection weights shown as crosses,projected in the two-dimensional input space of the first two inputvariables x(t) and x(t−6). It can be seen from FIGS. 18A and 18B thatthe number of the rule nodes is optimised after every 50 examples arepresented. The rule nodes are located in the input and the outputproblem spaces so that they represent cluster centres of the input datathat have similar output values subject to an error difference E.

The generalisation error of a neural network module on a next new inputvector (or vectors) from the input stream calculated through theevolving process is called local on-line generalisation error. The localon-line generalisation error at the moment t for example, when the inputvector is x(t) and the calculated by the evolved module output vector isy(t)′, is expressed as Err(t)=y(t)−y(t)′. The local on-line root meansquare error, and the local on-line non-dimensional error index LNDEI(t)can be calculated at each time moment t as:LRMSE(t)=√(Σ_(j=1, 2 . . . , t)(Err(i)²)/t);LNDEI(t)=LRMSE(t)/std(y(1):y(t))  (16)where std(y(1):y(t)) is the standard deviation of the output data pointsfrom 1 to t.

For the chosen values of the parameters, there were 16 rule nodesevolved each of them represented as one rule. Three of these rules areshown in FIG. 19, namely Rule 1, Rule 2 and Rule 16. These rules and thesystem inference mechanism define a system that is equivalent to theabove equation (16) in terms of the chosen inputs and output variablessubject to the calculated error.

As more input data is entered after certain time moment the LRMSE andLNDEI converge to constant values subject to a small error, in theexample from FIG. 19—LRMSE=0.043, LNDEI'0.191. Generally speaking, inthe case of compact and bounded problem space the error can be madesufficiently small subject to appropriate selection of the parametervalues for the system and the initial data stream. In the experimentabove the chosen error tolerance was comparatively high, but theresulting system was compact. If the chosen error threshold E wassmaller (e.g. 0.05, or 0.02) more rule nodes would have been evolved andbetter prediction accuracy could have been achieved. Different neuralnetwork modules have different optimal parameter values, which dependson the task (e.g. time series prediction, classification).

A further example has been conducted in which the system has been usedfor off-line training and testing. The following parameter values areinitially set before the system is evolved, namely MF=5, S=0.92, E=0.02,m=3, l=0.005. The system is evolved on the first 500 data examples fromthe same Mackey-Glass time series as in the example above for one passof learning. FIG. 20 shows the desired versus the predicted on-linevalues of the time-series. After the system is evolved, it is tested fora global generalisation of the second 500 examples. FIG. 21 shows thedesired values versus the values predicted by the system in an off-linemode.

In a general case, the global generalisation root mean square error(RMSE) and the non-dimensional error index are evaluated on a set of pnew examples from the problem space as follows:RMSE=√(Σ_(i=1, 2, . . . p)[(y _(i) −y _(i)′)² ]/p;NDEI=RMSE/std(1:p),  (17)where std (1:p) is the standard deviation of the data from 1 to p in thetest set. The evaluated data in this example RMSE is 0.01 and the NDEIis 0.046. After having evolved the system on a small but representativepart of the whole problem space, its global generalisation error issufficiently minimised.

The system is also tested for on-line test error on the test data whilefurther training on it is performed. The on-line local test error isslightly smaller.

In one experimental application the preferred system can be used forlife-long unsupervised learning from a continuous stream of new data.Such is the case of learning new sounds of new languages or new accentsunheard before. One experiment is described with reference to FIGS. 22and 23. The system is presented with the acoustic features of a spokenword “eight” having a phonemic representation of/silence//ei//t//silence/. In the experimental results shown in FIG. 22,three time lags of 26 mel scale coefficient taken from a window of 12 msof the speech signal, with an overlap of 50%, are used to form78-element input vectors. The input vectors are plotted over time asshown in FIG. 23.

Each new input vector from the spoken word is either associated with anexisting rule node that is modified to accommodate this data, or a newrule node is created. The rule nodes are aggregated at regular intervalswhich reduces the number of the nodes placed at the centres of the dataclusters. After the whole word is presented, the aggregated rule nodesrepresent the centres of the anticipated phoneme clusters without theconcept of phonemes being introduced to the system.

FIGS. 22 and 23 show clearly that three rule nodes were evolved afteraggregation that represent the input data. For example, frames 0 to 53indicated at 120 and frames 96 to 170 indicated at 122 are allocated torule node 1 which represents the phoneme/silence/. Frames 56 to 78indicated at 124 are allocated to rule node 2 which represents thephoneme/ei/. Frames 85 to 91 indicated at 126 are allocated to rule node3 which represents the phoneme/t/. The remaining frames representtransitional states. For example, frames 54 to 55 represent thetransition between /silence and /ei/. Frames 79 to 84 represent thetransition between /ei/ and /t/. Frames 92 to 96 represent thetransition between /t/ and /silence/. These frames are allocated to someof the closest rule nodes in the input space. If a higher sensitivitythreshold is used, this would have resulted in additional rule nodesevolved to represent these short transitional sounds.

When further pronunciations of the word “eight” or other words arepresented to the unsupervised system, the system refines the phonemeregions and the phoneme rule nodes or creates new phoneme rule nodes.The unsupervised learning method described above permits experimentingwith different strategies of learning, namely increased sensitivity overtime, decreased sensitivity over time and using forgetting in theprocess of learning. It also permits experimenting with severallanguages in a multilingual system.

In an experimental setting a system is evolved on both spoken words fromNew Zealand English and spoken words from Maori. Some of the evolvedphoneme rule nodes are shared between the acoustical representation ofthe languages as it is illustrated in FIG. 24 where the evolved rulenodes as well as a trajectory of spoken word ‘zoo’ are plotted in the 2dimensional space of the first two principal components of the inputacoustic space. The rule nodes in the evolved system represent a compactrepresentation of the acoustic space of the two languages presented tothe system. The system can continuously be trained on further words ofthe two languages or more languages, thus refining the acoustic spacerepresentation with the use of the sharing sounds (phonemes) principle.

The system has been subject to an experiment concerned with the task ofon-line time series prediction of the Mackey-Glass data. Here thestandard CMU benchmark format of the time series is used. The data isgenerated with τ=17 using a second order Runge-Kutta method with a stepsize of 0.1, of four inputs, namely x(t), x(t−6), x(t−12) and x(t−18)and one output namely x(t+85). Training data is from t−200 to t=3200while test data is from t=5000 to t=5500. All 3000 training data setswere used to evolve two types of neural network modules.

For the purposes of the first and second learning algorithms describedabove, the following initial values of the parameters were chosen: MF=3,S=0.7, E=0.02, m=3, l=0.02, Rmax=0.2, N_(agg)=100. The number of thecentres and the local on-line LNDEI is calculated and compared with theresults for the RAN model as described in Platt, J “A resourceallocating network for function interpolation”, Neural Computation3,213-225 (1991) and modifications.

The results are shown in FIG. 25. The two modifications of the systemresult in a smaller on-line error than the other methods and in areasonable number of rule nodes. The two learning algorithms are shownas System-su and System-dp.

As the system preferably uses linear equations for calculating theactivation of the rule nodes, rather than Gaussian functions andexponential functions as in the RAN model, the present system learningprocedure is faster than the learning procedure in the RAN model and itsmodifications. The system also produces better on-line generalisation,which is a result of more accurate node allocation during the learningprocess. This is in addition to the advantageous knowledgerepresentation features of the preferred system that includes clusteringof the input space, and rule extraction and rule insertion.

The system has also been subject to a further experiment dealing with aclassification task on a case study data of spoken digits. The task isrecognition of speaker independent pronunciations of English digits fromthe Otago corpus database (http://kel.otago.ac.nz/hyspeech/corpus/).Seventeen speakers (12 males and 5 females) are used for training and afurther 17 speakers (12 males and 5 females) are used for off-linetesting. Each speaker utters 30 instances of English digits during arecording session in a quiet room, resulting in clean data, for a totalof 510 training and 510 testing utterances. Eight mel frequency scalecepstrum coefficients (MFSCC) and log-energy are used as acousticfeatures. In order to assess the performance of the system in thisapplication, a comparison with Linear Vector Quantisation (LVQ) isaccomplished. Clean training speech is used to train both LVQ and thepresent system. Office noise is introduced to the test speech data toevaluate the behaviour of the recognition systems in a noisyenvironment, with a Signal-to-Noise ratio of 10 dB.

The classification off-line test accuracy for the LVQ model and thepresent system, and also the local on-line test accuracy for the systemare evaluated and shown in FIG. 26.

The LVQ model has the following parameter values, namely code-bookvectors 396, training iterations 15840. The present system has thefollowing parameter values of one training iteration, 3 MFs, 157 rulenodes, initial values for S=0.9, E=0.1, l=0.01. Maximum radius isRmax=0.2 and the number of examples for aggregation N_(agg)=100.

The results show that the present system with off-line learning andtesting on new data performs much better than the LVQ method as shown inFIG. 26. As the present system allows for continuous training on newdata, further testing and also training of the system on the test datain an on-line mode leads to a significant improvement of accuracy.

The system has also been subject to a further experiment dealing with aclassification task on a bio-informatics case study data obtained fromthe machine learning database repository at the University of Californiaat Irvine. It contains primate splice-junction gene sequences for theidentification of splice site boundaries within these sequences. Ineukaryotes the genes that code for proteins are split into codingregions (exons) and noncoding regions (introns) of the DNA sequence atdefined boundaries, the so called splice sites. The data set consists of3190 DNA sequences which are 60 nucleotides long and classified eitheras an exon-intron boundary (EI), an intron-exon boundary (IE) andnon-splice site (N). The system uses 2 MF and a four bit encoding schemefor the bases.

After training the system on existing data the system is able toidentify potential splice sites within new sequences. Using a slidingwindow of 60 bases to cover the entire sequence being examined, theboundaries are identified as EI, IE, or N. A score is given to eachboundary identified that represents the likelihood that the identifiedboundary has been identified correctly. The system can be continuouslytrained on new known data sequences, thus improving its performance onunknown data sequences. At any time of the functioning of the systemknowledge can be extracted from it in the form of semanticallymeaningful rules that describe important biological relationships. Someof the extracted rules with a rule extraction threshold T1=T2=0.7 arefurther simplified, formatted and presented in a way that can beinterpreted by the user, as shown in FIG. 27. Using different ruleextraction thresholds would allow extraction of different sets of rulesthat have different levels of abstraction, thus allowing for a betterunderstanding of the gene sequences.

The system has also been subject to a further experiment dealing with aclassification task on a bio-informatics case study data which is a dataset of 72 classification examples for leukemia cancer disease. The dataset consists of two classes and a large input space—the expressionvalues of 7,129 genes monitored by Affymatrix arrays (Golub et all). Thetwo types of leukemia are acute myeloid leukemia (AML) and acutelymphoblastic leukemia (ALL).

The task is twofold: 1) Finding a set of genes distinguishing ALL andAML, and 2) Constructing a classifier based on the expression of thesegenes allowing for new data to be entered to the system once they havebeen made available. The system accommodates or adapts this dataimproving the classification results. The system is evolved through onepass training on each consecutive example and testing it on the nextone.

During the process of on-line evolving the system learns each exampleand then attempts to predict the class of the next one. Here the systemcontinually evolves with new examples accommodated, as they becomeavailable. At any time of the system operation rules that explain whichgenes are more closely related to each of the classes can be extracted.FIG. 28 shows two of the extracted rules after the initial 72 examplesare learned by the system. The rules are “local” and each of them hasthe meaning of the dominating rule in a particular cluster of the inputspace.

The system in an on-line learning mode could be used as building blocksfor creating adaptive speech recognition systems that are based on anevolving connectionist framework. Such systems would be able to adapt tonew speakers and new accents, and add new words to their dictionaries atany time of their operation.

Possible applications of the invention include adaptive speechrecognition in a noisy environment, adaptive spoken language evolvingsystems, adaptive process control, adaptive robot control, adaptiveknowledge based systems for learning genetic information, adaptiveagents on the Internet, adaptive systems for on-line decision making onfinancial and economic data, adaptive automatic vehicle driving systemsthat learn to navigate in a new environment (cars, helicopters, etc),and classifying bio-infomatic data.

The foregoing describes the invention including preferred forms thereof.Alterations and modifications as will be obvious to those skilled in theart are intended to be incorporated within the scope hereof, as definedby the accompanying claims.

1. A neural network program embodied on a computer-readable mediumcomprising: an input layer comprising one or more input nodes arrangedto receive input data; a rule base layer comprising one or more rulenodes; an output layer comprising one or more output nodes, each rulenode in the rule base layer having a minimum activation threshold andeach rule node arranged to be activated when input data satisfies theminimum activation threshold of the rule node; and an adaptive componentarranged to aggregate selected two or more rule nodes in the rule baselayer based on the input data.
 2. The neural network program as claimedin claim 1, wherein the parameters of the activation threshold of eachrule node activated by input data are adjusted based on the input data.3. The neural network program as claimed in claim 2, wherein theadaptive component is arranged to aggregate two or more rule nodes basedon the magnitude of activation when activated by input data.
 4. Theneural network program as claimed in claim 2, wherein the parameters ofactivation threshold of each rule node are adjusted based at leastpartially on new input data.
 5. The neural network program as claimed inclaim 2, further comprising input data stored in a memory, theparameters of the activation threshold of each rule node arranged to beadjusted based at least partially on the stored input data.
 6. Theneural network program as claimed in claim 1, wherein each rule node isassigned a magnitude of activation when activated by input data.
 7. Theneural network program as claimed in claim 1, wherein the adaptivecomponent is arranged to increase the minimum activation threshold ofone or more rule nodes not selected for aggregation.
 8. The neuralnetwork program as claimed in claim 1, wherein the parameters of theactivation threshold of each rule node activated by input data areadjusted based on both the input data and desired output data.
 9. Theneural network program as claimed in claim 1, wherein the adaptivecomponent is arranged to insert new rule nodes into the rule base layer.10. The neural network program as claimed in claim 1 wherein theadaptive component is arranged to extract rules from the rule baselayer.
 11. The neural network program as claimed in claim 1, furthercomprising a fuzzy input layer comprising one or more fuzzy input nodesarranged to transform input node values for use by the rule base layer.12. The neural network program as claimed in claim 1, further comprisinga fuzzy output layer comprising one or more fuzzy output nodes arrangedto transform data output from the rule base layer.
 13. An adaptivelearning program embodied on a computer readable medium comprising morethan one neural network program each neural network program comprising:an input layer comprising one or more input nodes arranged to receiveinput data; a rule base layer comprising one or more rule nodes, eachrule node in the rule base layer having a minimum activation thresholdand each rule node arranged to be activated when input data satisfiesthe minimum activation threshold of the rule node; an output layercomprising one or more output nodes; and an adaptive component arrangedto aggregate selected two or more rule nodes in the rule base layerbased on the input data.
 14. A neural network module comprising: aninput layer comprising one or more input nodes arranged to receive inputdata; a rule base layer comprising one or more rule nodes, each rulenode having a minimum activation threshold, each rule node configured tobe activated where when input data satisfies the minimum activationthreshold of the rule node; an output layer comprising one or moreoutput nodes; and an adaptation component arranged to aggregate selectedtwo or more rule nodes in the rule base layer based on the input data,and to increase the minimum activation threshold of one or more rulenodes not selected for aggregation.
 15. The neural network module asclaimed in claim 14, wherein the parameters of the activation thresholdof each rule node activated by input data are adjusted based on theinput data.
 16. The neural network module as claimed in claim 14,wherein each rule node is assigned a magnitude of activation whenactivated by input data.
 17. The neural network module as claimed inclaim 16, wherein the adaptation component is configured to aggregatetwo or more rule nodes based on the magnitude of activation whenactivated by input data.
 18. The neural network module as claimed inclaim 14, wherein the parameters of the activation threshold of eachrule node activated by input data are adjusted based on both the inputdata and desired output data.
 19. The neural network module as claimedin claim 14, wherein the adaptation component is configured to insertnew rule nodes into the rule base layer.
 20. The neural network moduleas claimed in claim 14, wherein the adaptation component is configuredto extract rules from the rule base layer.
 21. The neural network moduleas claimed in claim 14, wherein the parameters of the activationthreshold of each rule node are adjusted based at least partially on newinput data.
 22. The neural network module as claimed in claim 14,further comprising a memory in which is stored input data, wherein theparameters of the activation threshold of each rule node are adjustedbased at least partially on the stored input data.
 23. The neuralnetwork module as claimed in claim 14, further comprising a fuzzy inputlayer comprising one or more fuzzy input nodes arranged to transforminput node values for use by the rule base layer.
 24. The neural networkmodule as claimed in claim 14, further comprising a fuzzy output layercomprising one or more fuzzy output nodes arranged to transform dataoutput from the rule base layer.
 25. An adaptive learning systemcomprising more than one neural network module, each neural networkmodule comprising: an input layer comprising one ore more input nodesarranged to receive input data; a rule base layer comprising one or morerule nodes, each rule node having a minimum activation threshold andeach rule node configured to be activated when input data satisfies theminimum activation threshold of the rule node; an output layercomprising one or more output nodes; and an adaptation componentarranged to aggregate selected two or more rule nodes in the rule baselayer based on the input data, and to increase the minimum activationthreshold of one or more rule nodes not selected for aggregation.
 26. Amethod of implementing a neural network module comprising the steps of:maintaining in computer memory an input layer comprising one or moreinput nodes to receive input data; maintaining in computer memory a rulebase layer comprising one or more rule nodes; assigning a minimumactivation threshold to each rule node in the rule base layer, each rulenode being activated when input data satisfies the minimum activationthreshold of the rule node; maintaining in computer memory an outputlayer comprising one or more output nodes; aggregating selected two ormore rule nodes in the rule base layer based on the input data; andincreasing the minimum activation threshold of one or more rule nodesnot selected for aggregation.
 27. The method of implementing a neuralnetwork module as claimed in claim 26, further comprising the step ofadjusting the parameters of the activation threshold of each ruleactivated by input based on the input data.
 28. The method ofimplementing a neural network module as claimed in claim 26, furthercomprising the step of assigning to each rule node a magnitude ofactivation when activated by input data.
 29. The method of implementinga neural network module as claimed in claim 26, further comprising thestep of aggregating two or more rule nodes based on the magnitude ofactivation when activated by input data.
 30. The method of implementinga neural network module as claimed in claim 26, further comprising thestep of adjusting the parameters of the activation threshold of eachrule node activated by input data based on both the input data anddesired output data.
 31. The method of implementing a neural networkmodule as claimed in claim 26, further comprising the step of insertingnew rule nodes into the rule base layer.
 32. The method of implementinga neural network module as claimed in claim 26, further comprising thestep of extracting rules from the rule base layer.
 33. The method ofimplementing a neural network module as claimed in claim 26, furthercomprising the step of adjusting the parameters of the activationthreshold of each rule node based at least partially on new input data.34. The method of implementing a neural network module as claimed inclaim 26, further comprising the steps of maintaining in computer memoryinput data; and adjusting the parameters of activation threshold of eachrule node based at least partially on the stored input data.
 35. Themethod of implementing a neural network module as claimed in claim 26,further comprising the step of maintaining in computer memory a fuzzyinput layer comprising one or more fuzzy input nodes to transform inputnode values for use by the rule base layer.
 36. The method ofimplementing a neural network module as claimed in claim 26, furthercomprising the steps of maintaining in computer memory a fuzzy outputlayer comprising one or more fuzzy output nodes to transform data outputfrom the rule base layer.